Outline

  1. Make a data folder

  2. Drag favorability.csv into the data folder

  3. Make existing folder and RStudio project

  4. Open an R Markdown Notebook

  5. library(tidyverses) plus other libraries

  6. IMPORT data See Also RStudio data import wizard

  7. ATTACHE data

  8. EDA: Visualize ggplot(data = starwars, aes(hair_color)) + geom_bar()

  9. EDA: skimr::skim(starwars)

  10. EDA: summary(fav_rating)

  11. left_join(starwars, fivethirtyeight)

  12. Transform data: five dplyr verbs …

    • count / group_by & summarize
  13. Interactive visualization ggplotly

  14. Quick Linear Regression

  15. Reports: notebooks, slides, dashboards, word document, PDF, book, etc.


5. library(tidyverses) plus other libraries

library(tidyverse)
library(skimr)
library(plotly)
library(moderndive)
# library(broom)

6. read_csv(favorability.csv) See Also data import wizard

## fav_data <- read_csv("data/fav.csv")
favorability <- read_csv("https://raw.githubusercontent.com/libjohn/intro2r-code/master/data/538_favorability_popularity.csv", skip = 11)
Parsed with column specification:
cols(
  name = col_character(),
  fav_rating = col_double()
)

7 attached on-board data

dplyr::starwars

data("starwars")

8 Quick visualization

Visualize with the ggplot2 library.

plot <- ggplot(data = starwars, aes(hair_color)) + 
  geom_bar()
plot

One improvement

Arrange bars by frequency using forcats::fct_infreq

plot1 <- ggplot(data = starwars, aes(fct_infreq(hair_color))) + 
  geom_bar()
plot1

9. skimr::skim(starwars)

The skimr library presents summary EDA results using the skim() function

skim(starwars)
-- Data Summary ------------------------
                           Values  
Name                       starwars
Number of rows             87      
Number of columns          14      
_______________________            
Column type frequency:             
  character                8       
  list                     3       
  numeric                  3       
________________________           
Group variables            None    

-- Variable type: character --------------------------------------------------------------------------------
# A tibble: 8 x 8
  skim_variable n_missing complete_rate   min   max empty n_unique whitespace
* <chr>             <int>         <dbl> <int> <int> <int>    <int>      <int>
1 name                  0         1         3    21     0       87          0
2 hair_color            5         0.943     4    13     0       12          0
3 skin_color            0         1         3    19     0       31          0
4 eye_color             0         1         3    13     0       15          0
5 sex                   4         0.954     4    14     0        4          0
6 gender                4         0.954     8     9     0        2          0
7 homeworld            10         0.885     4    14     0       48          0
8 species               4         0.954     3    14     0       37          0

-- Variable type: list -------------------------------------------------------------------------------------
# A tibble: 3 x 6
  skim_variable n_missing complete_rate n_unique min_length max_length
* <chr>             <int>         <dbl>    <int>      <int>      <int>
1 films                 0             1       24          1          7
2 vehicles              0             1       11          0          2
3 starships             0             1       17          0          5

-- Variable type: numeric ----------------------------------------------------------------------------------
# A tibble: 3 x 11
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50   p75  p100 hist 
* <chr>             <int>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 height                6         0.931 174.   34.8    66 167     180 191     264 ▁▁▇▅▁
2 mass                 28         0.678  97.3 169.     15  55.6    79  84.5  1358 ▇▁▁▁▁
3 birth_year           44         0.494  87.6 155.      8  35      52  72     896 ▇▁▁▁▁

10. summary

summary(favorability)
     name             fav_rating   
 Length:14          Min.   :110.0  
 Class :character   1st Qu.:148.5  
 Mode  :character   Median :392.0  
                    Mean   :369.0  
                    3rd Qu.:559.5  
                    Max.   :610.0  

11. left_join(starwars, fivethirtyeight)

Joins or merges are part of thedplyr library.

starwars %>% 
  left_join(favorability, by = "name") %>% 
  select(name, fav_rating, everything()) %>% 
  arrange(-fav_rating)

12. Transform data:

From the dplyr library, use the five verbs …

select to subset data by columns

starwars %>% 
  select(name, gender, hair_color)

filter to subset data rows

starwars %>% 
  filter(gender == "feminine")

arrange to sort data

starwars %>% 
  arrange(desc(height), desc(name))

mutate to add new variable or transform existing

starwars %>%
  drop_na(mass) %>% 
  select(name, mass) %>% 
  mutate(big_mass = mass * 2)

count / group_by & summarize

subtotals of variables

starwars %>% 
  count(gender)

Variable totals (and also, but not here, calculations)

starwars %>% 
  drop_na(mass) %>% 
  summarise(sum(mass))

Variable subtotals and calculations

group_by(gender, species) %>% summarise(mean_height = mean(height), total = n())

starwars %>% 
  drop_na(height) %>% 
  group_by(gender, species) %>% 
  summarise(mean_height = mean(height), total = n()) %>% 
  arrange(desc(total)) %>%
  drop_na(species) %>%
  filter(total > 1) %>% 
  select(species, gender, total, everything())

13. Interactive visualization

from the plotly library

ggplotly(plot1)

14. Regression / models

Predict mass from height after eliminating Jabba from the data set. Here we’ll use primarily base R, moderndive for model outputs, and tidyverse for the pipe %>% and dplyr for data transformations. Plus, alternatively, the broom library to manipulate models.

model <- lm(mass ~ height, data = starwars %>% filter(mass < 500))
model

Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass < 
    500))

Coefficients:
(Intercept)       height  
   -32.5408       0.6214  
summary(model)

Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass < 
    500))

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

A nice Explanation of Basic Regression can be found in chapter 5 of the book Statistical Inference via Data Science. You can also use the moderndive library packages to access helpful functions such as: get_correlatin(), get_regression_table(), etc.

You may also appreciate or prefer the broom package for the very nice tidy(), glance(), and augment() functions.

starwars %>% 
  filter(mass < 500) %>% 
  get_correlation(mass ~ height)
# tidy(model)
get_regression_table(model)
# broom::glance(model)
get_regression_summaries(model)
# broom::augment(model)
get_regression_points(model)

Visualize regression

mass over height with a fitted linear regression line and confidence interval using geom_smooth()

starwars %>% 
  filter(mass < 500) %>%
  ggplot(aes(height, mass)) +
  geom_jitter() +
  geom_smooth(method = "lm")

15. Render reports

By changing the argument in the YAML header, you can render many report styles. A few popular examples include

type YAML syntax More information
notebook (alpha or dev) output: html_notebook Notebook
notebook (final or prod) output: html_document HTML document
Word document output: word_document MS Word
slide deck See Get Started Xaringan
dashboards   flexdashboard
e-book / web-book   Bookdown
website   Blogdown
website (simpler) Create a website / Distill
PDF output: pdf_document PDF document
---
title: "Quickstart demo"
author: "John Little"
date: "`r Sys.Date()`"
output: html_notebook
---

## Outline

1. Make a data folder
1. Drag favorability.csv into the data folder
1. Make existing folder and RStudio project
1. Open an R Markdown Notebook
1. `library(tidyverses)`  plus other libraries
1. IMPORT data See Also _RStudio data import wizard_
1. ATTACHE data
1. EDA: Visualize `ggplot(data = starwars, aes(hair_color)) + geom_bar()` 
1. EDA: `skimr::skim(starwars)`
1. EDA: summary(fav_rating)
1. `left_join(starwars, fivethirtyeight)`
1. Transform data: five dplyr verbs ...

    - `count` / `group_by` & `summarize`
        
1. Interactive visualization ggplotly
1. Quick Linear Regression
1. Reports:  notebooks, slides, dashboards, word document, PDF, book, etc.

---

## 5. `library(tidyverses)`  plus other libraries

```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(skimr)
library(plotly)
library(moderndive)
library(broom)
```

## 6. `read_csv(favorability.csv)`   See Also data import wizard

```{r}
## fav_data <- read_csv("data/fav.csv")
favorability <- read_csv("https://raw.githubusercontent.com/libjohn/intro2r-code/master/data/538_favorability_popularity.csv", skip = 11)
```


## 7 attached on-board data

- dplyr::starwars

`dplyr::starwars`

```{r}
data("starwars")
```

## 8 Quick visualization

Visualize with the `ggplot2` library.

```{r}
plot <- ggplot(data = starwars, aes(hair_color)) + 
  geom_bar()
plot
```

### One improvement

Arrange bars by frequency using forcats::fct_infreq


```{r}
plot1 <- ggplot(data = starwars, aes(fct_infreq(hair_color))) + 
  geom_bar()
plot1
```


## 9. `skimr::skim(starwars)`

The `skimr` library presents summary EDA results using the `skim()` function 

```{r}
skim(starwars)
```

## 10. summary

```{r}
summary(favorability)
```

## 11. `left_join(starwars, fivethirtyeight)`

Joins or merges are part of the`dplyr` library.

```{r}
starwars %>% 
  left_join(favorability, by = "name") %>% 
  select(name, fav_rating, everything()) %>% 
  arrange(-fav_rating)
```


## 12. Transform data: 

From the `dplyr` library, use the five  verbs ...

### `select` to subset data by columns

```{r}
starwars %>% 
  select(name, gender, hair_color)
```

### `filter` to subset data rows 

```{r}
starwars %>% 
  filter(gender == "feminine")
```

### `arrange` to sort data

```{r}
starwars %>% 
  arrange(desc(height), desc(name))
```

### `mutate` to add new variable or transform existing

```{r}
starwars %>%
  drop_na(mass) %>% 
  select(name, mass) %>% 
  mutate(big_mass = mass * 2)
```


### `count` / `group_by` & `summarize`

subtotals of variables


```{r}
starwars %>% 
  count(gender)
```

Variable totals (and also, but not here, calculations)

```{r}
starwars %>% 
  drop_na(mass) %>% 
  summarise(sum(mass))
```

Variable subtotals and calculations

> `group_by(gender, species) %>% 
   summarise(mean_height = mean(height), total = n())`

```{r message=FALSE, warning=FALSE}
starwars %>% 
  drop_na(height) %>% 
  group_by(gender, species) %>% 
  summarise(mean_height = mean(height), total = n()) %>% 
  arrange(desc(total)) %>%
  drop_na(species) %>%
  filter(total > 1) %>% 
  select(species, gender, total, everything())
```

## 13. Interactive visualization

from the `plotly` library


```{r}
ggplotly(plot1)
```

## 14. Regression / models

Predict mass from height after eliminating Jabba from the data set.  Here we'll use primarily base R, `moderndive` for model outputs, and tidyverse for the pipe `%>%` and `dplyr` for data transformations.  Plus, alternatively, the `broom` library to manipulate models. 

```{r}
model <- lm(mass ~ height, data = starwars %>% filter(mass < 500))
model
summary(model)
```

A nice _Explanation of Basic Regression_ can be found in [chapter 5](https://moderndive.com/5-regression.html) of the book [_Statistical Inference via Data Science_](https://moderndive.com/).  You can also use the `moderndive`
library packages to access helpful functions such as:  `get_correlatin()`, `get_regression_table()`, etc.

You may also appreciate or prefer the [broom](https://broom.tidyverse.org) package for the very nice `tidy()`, `glance()`, and `augment()` functions.

```{r}
starwars %>% 
  filter(mass < 500) %>% 
  get_correlation(mass ~ height)
```


```{r}
# tidy(model)
get_regression_table(model)
```

```{r}
# broom::glance(model)
get_regression_summaries(model)
```


```{r}
# broom::augment(model)
get_regression_points(model)
```

### Visualize regression

`mass` over `height` with a fitted linear regression line and confidence interval using `geom_smooth()`

```{r}
starwars %>% 
  filter(mass < 500) %>%
  ggplot(aes(height, mass)) +
  geom_jitter() +
  geom_smooth(method = "lm")
```

## 15. Render reports

By changing the argument in the YAML header, you can render many report styles.  A few popular examples include

type | YAML syntax | More information
--- | --- | ---
notebook (alpha or dev) | output: html_notebook | [Notebook](https://bookdown.org/yihui/rmarkdown/notebook.html)
notebook (final or prod) | output: html_document | [HTML document](https://bookdown.org/yihui/rmarkdown/html-document.html)
Word document | output: word_document | [MS Word](https://bookdown.org/yihui/rmarkdown/word-document.html)
slide deck | See [Get Started](https://slides.yihui.org/xaringan/#4) | [Xaringan](https://slides.yihui.org/xaringan/#1)
dashboards | &nbsp; | [flexdashboard](https://rmarkdown.rstudio.com/flexdashboard/) | [shiny](https://shiny.rstudio.com/)
e-book / web-book | &nbsp; | [Bookdown](https://bookdown.org/yihui/bookdown/)
website | &nbsp; | [Blogdown](https://bookdown.org/yihui/blogdown/)
website (simpler) | | [Create a website / Distill](https://rstudio.github.io/distill/website.html)
PDF | output: pdf_document | [PDF document](https://bookdown.org/yihui/rmarkdown/pdf-document.html)



